Deep learning applied to EEG source-data reveals both ventral and dorsal visual stream involvement in holistic processing of social stimuli

Perception of social stimuli (faces and bodies) relies on “holistic” (i.e., global) mechanisms, as supported by picture-plane inversion: perceiving inverted faces/bodies is harder than perceiving their upright counterpart. Albeit neuroimaging evidence suggested involvement of face-specific brain areas in holistic processing, their spatiotemporal dynamics and selectivity for social stimuli is still debated. Here, we investigate the spatiotemporal dynamics of holistic processing for faces, bodies and houses (adopted as control non-social category), by applying deep learning to high-density electroencephalographic signals (EEG) at source-level. Convolutional neural networks were trained to classify cortical EEG responses to stimulus orientation (upright/inverted), separately for each stimulus type (faces, bodies, houses), resulting to perform well above chance for faces and bodies, and close to chance for houses. By explaining network decision, the 150–200 ms time interval and few visual ventral-stream regions were identified as mostly relevant for discriminating face and body orientation (lateral occipital cortex, and for face only, precuneus cortex, fusiform and lingual gyri), together with two additional dorsal-stream areas (superior and inferior parietal cortices). Overall, the proposed approach is sensitive in detecting cortical activity underlying perceptual phenomena, and by maximally exploiting discriminant information contained in data, may reveal spatiotemporal features previously undisclosed, stimulating novel investigations.

Section S1. Convolutional neural network In Table S.1, details about the parameters defining the network (i.e., hyper-parameters), together with the number of parameters to fit (i.e., "trainable" parameters) introduced by each layer and the shape of layer outputs, are reported.
Table S1 -Details of the convolutional neural network. Each layer is provided with its name, main hyper-parameters, number of trainable parameters and output shape. Where not specified, stride ( ) and padding ( ) were set to (1,1) and (0,0), respectively. The total number of trainable parameters was 1502.

Layer name
The input layer simply replicates the input neural activity in a single feature map; thus, the output shape of this layer is (1,68,100). Then, the first convolutional layer performs 2-D convolution in the temporal domain using ! = 4 temporal kernels with size ! = (1,49) (thus, capturing frequency information at 4 Hz and above 1 ), unitary stride and zero-padding such that the local output shape matches the input shape, i.e., ! = (1,24). Neuron activations were then normalized via batch normalization 2 . The second convolutional layer performs 2-D depthwise convolution 3 in the spatial domain, learning a set of " = 2 spatial kernels for each filtered version of the input (8 in total, as ! = 4), with size " = ( , 1) = (68,1) (that is, learning the optimal combination across all ROIs), unitary stride and no padding. Neuron activations were then normalized via batch normalization, passed through a ReLU nonlinearity, and downsampled in time using an average pooling layer with pooling size # = (1,10) and pooling stride # = (1,2), which is equivalent to applying a moving average in the time-axis within windows of 5 ms with a step of 1 ms. Then, neuron activations were dropped out during training using a dropout probability of = 0.25. The output neuron activations were then flattened into a 1-D array and provided as input to the fully-connected layer with $ = 2 neurons, providing the class scores % , = 0,1 as output. Finally, class scores were converted into the conditional probabilities by using the softmax activation function.
By keeping limited the number layers and of learned features (e.g., only 4 temporal kernels and 16 spatial kernels, overall) and by including depthwise convolutions, which are convolutions specifically aimed to reduce the number of trainable parameters, the adopted CNN introduced only 1502 parameters (contained in ). To optimize the trainable parameters in , the cross-entropy between the empirical probability distribution (defined by training labels) and the model probability distribution (defined by CNN outputs) was used as loss function and it was minimized using the Adaptive moment estimation (Adam) algorithm 4 with a mini-batch size of 64, learning rate of 10 &' and other parameters set as in its default implementation 5 . CNNs were trained for 500 epochs and the training was stopped when the validation loss did not decrease for 100 consecutive epochs (this parameter was based on the convergence speed of the algorithm via empirical evaluations), as also performed in 6,7 .
The main hyper-parameters (e.g., the number of temporal kernels, temporal kernel size, the number of spatial kernels, and the pooling size and stride) were selected via empirical evaluations during preliminary analyses.
CNNs were developed in Python (version 3.8.10) and trained with PyTorch (version 1.9.0) 5 and network decisions were explained with Captum (0.5.0) 8 , using a workstation equipped with an AMD Threadripper 1900X, NVIDIA TITAN V and 48 GB of RAM.

Section S2. Layer-wise relevance propagation
Layer-wise relevance propagation 9 propagates the prediction of the network, represented by the class score % (e.g., the predicted score associated by the network to the inverted orientation, " , see Eq.